1 Preface

The goal of this problem set is to develop some intuition about the impact of the number of nodes in the hidden layer of the neural network. We will use few simulated examples to have clear understanding of the structure of the data we are modeling and will assess how performance of the neural network model is impacted by the structure in the data and the setup of the network.

First of all, to compensate for lack of coverage on this topic in ISLR, let’s go over a couple of simple examples. We start with simulating a simple two class dataset in 2D predictor space with an outcome representative of an interaction between attributes. (Please notice that for the problems you will be working on this week you will be asked below to simulate a dataset using a different model.)

# fix seed so that narrative always matches the plots:
set.seed(1234567890)
nObs <- 1000
ctrPos <- 2
xyTmp <- matrix(rnorm(4*nObs),ncol=2)
xyCtrsTmp <- matrix(sample(c(-1,1)*ctrPos,nObs*4,replace=TRUE),ncol=2)
xyTmp <- xyTmp + xyCtrsTmp
gTmp <- paste0("class",(1+sign(apply(xyCtrsTmp,1,prod)))/2)
plot(xyTmp,col=as.numeric(factor(gTmp)),pch=as.numeric(factor(gTmp)),xlab="X1",ylab="X2")
abline(h=0)
abline(v=0)

Symbol color and shape indicate class. Typical problem that will present a problem for any approach estimating a single linear decision boundary. We used similar simulated data for the random forest (week 10) problem set.

1.1 One hidden node

We can fit simple neural network (using all default values in the call to neuralnet – notice that both covariates and outcome have to be numeric as opposed to factor) and plot its layout (allowing for its output to be included in Rmarkdown generated report actually seems to be quite painful - one has to overwrite original implementation of plot.nn with the one that doesn’t call dev.new() that is included in this Rmarkdown file with echo=FALSE – to do the same you have to include that block into your Rmarkdown file also):

### Doesn't run: "requires numeric/complex ... arguments"
### nnRes <- neuralnet(g~X1+X2,data.frame(g=gTmp,xyTmp))
nnRes <- neuralnet(g~X1+X2,data.frame(g=as.numeric(factor(gTmp)),xyTmp))
plot(nnRes)

That shows us a model with one node in a single hidden layer (default parameters).

We can lookup actual model predictions and recalculate them from input variables (in the field covariate) and model weight and activation function (fields weights and act.fct respectively):

head(nnRes$net.result[[1]])
##          [,1]
## [1,] 1.639856
## [2,] 1.042447
## [3,] 1.639856
## [4,] 1.024622
## [5,] 1.639856
## [6,] 1.503985
cbind(rep(1,6),nnRes$act.fct(cbind(rep(1,6),nnRes$covariate[1:6,])%*%nnRes$weights[[1]][[1]]))%*%nnRes$weights[[1]][[2]]
##          [,1]
## [1,] 1.639856
## [2,] 1.042447
## [3,] 1.639856
## [4,] 1.024622
## [5,] 1.639856
## [6,] 1.503985

Notice that input parameter linear.output governs whether activation function is called on the value of the output node or not:

nnResNLO <- neuralnet(g~X1+X2,data.frame(g=as.numeric(factor(gTmp)),xyTmp),linear.output=FALSE)
head(nnResNLO$net.result[[1]])
##           [,1]
## [1,] 0.9999903
## [2,] 0.9999904
## [3,] 0.9999903
## [4,] 0.9999905
## [5,] 0.9999903
## [6,] 0.9999904
nnResNLO$act.fct(cbind(rep(1,6),nnResNLO$act.fct(cbind(rep(1,6),nnResNLO$covariate[1:6,])%*%nnResNLO$weights[[1]][[1]]))%*%nnResNLO$weights[[1]][[2]])
##           [,1]
## [1,] 0.9999903
## [2,] 0.9999904
## [3,] 0.9999903
## [4,] 0.9999905
## [5,] 0.9999903
## [6,] 0.9999904
quantile(nnResNLO$net.result[[1]])
##        0%       25%       50%       75%      100% 
## 0.9999902 0.9999903 0.9999904 0.9999904 0.9999905

As the last statement (the quantiles of the predicted out) above shows, the use of activation function limiting predicted values to \([0;1]\) range when modeling outcome taking values outside of \([0;1]\) interval does not result in a very useful model. In this case with true outcome values constrained to \(\{1,2\}\) so that the error is minimized by predicting every outcome to be as close to \(1\) as possible.

Using binary – 0 or 1 – outcome produces more useful model when activation function is applied to the output node (linear.output=FALSE) and allows use of cross-entropy error function (often used in classification setting in combination with the activation function applied to the output layer):

nnResNLO01 <- neuralnet(g~X1+X2,data.frame(g=as.numeric(factor(gTmp))-1,xyTmp),linear.output=FALSE)
quantile(nnResNLO01$net.result[[1]],c(0,0.1,0.25,0.5,1))
##         0%        10%        25%        50%       100% 
## 0.03744930 0.03908482 0.56598177 0.63918306 0.63925921
if ( FALSE ) {
  # from v.3.5.x(?) this started to throw an error instead of silent switch to "sse":
  nnResNLO12CE <- neuralnet(g~X1+X2,data.frame(g=as.numeric(factor(gTmp)),xyTmp),linear.output=FALSE,err.fct="ce")
}
nnResNLO01CE <- neuralnet(g~X1+X2,data.frame(g=as.numeric(factor(gTmp))-1,xyTmp),linear.output=FALSE,err.fct="ce")
head(nnResNLO01CE$net.result[[1]])
##            [,1]
## [1,] 0.63765805
## [2,] 0.04001203
## [3,] 0.63765803
## [4,] 0.01075720
## [5,] 0.63765804
## [6,] 0.52643332
nnResNLO01CE$act.fct(cbind(rep(1,6),nnResNLO01CE$act.fct(cbind(rep(1,6),nnResNLO01CE$covariate[1:6,])%*%nnResNLO01CE$weights[[1]][[1]]))%*%nnResNLO01CE$weights[[1]][[2]])
##            [,1]
## [1,] 0.63765805
## [2,] 0.04001203
## [3,] 0.63765803
## [4,] 0.01075720
## [5,] 0.63765804
## [6,] 0.52643332
quantile(nnResNLO01CE$net.result[[1]])
##         0%        25%        50%        75%       100% 
## 0.01073063 0.57038679 0.63728499 0.63765658 0.63765805

We can plot model output indicating class identity (left panel below) that tells us that when true outcome values are constrained to 1 or 2, sum of squared errors is used as error function and output node values are used as-is (not transformed by activation function – linear.output=TRUE by default), the majority of the points were estimated to be close to 1 or 1.6 and that majority of those estimated to be close to 1 correspond to the about half of the observations at the first level of the class factor (i.e. numerical value of 1). It is also easy to see that those with predicted value of 1.6 represent roughly 1:2 mix of observations from the first and second levels of the outcome respectively, so that \(1.6 \approx (1+2*2)/3 = 5/3\) approximately equals average of their numerical values corresponding to the levels of the factor representing them.

The nature of the model estimated by neuralnet in this (very simple!) case becomes even more intuitive if we render all points in the area encompassing our training set with model predictions and overlay training dataset on top of that (right panel below). It is immediately apparent that this model identified a line in this 2D space separating one cloud of points belonging mostly to one class from all others so that predicted values are approximately equal to the average outcome on each side of this decision boundary.

plotNNpreds2D2class <- function(inpNN,inpClassThresh,inpGrid=(-60:60)/10) {
  tmpClrPch <- as.numeric(factor(inpNN$response))
  plot(inpNN$net.result[[1]],col=tmpClrPch,pch=tmpClrPch)
  table(inpNN$net.result[[1]][,1]>inpClassThresh,inpNN$response)
  xyGridTmp <- cbind(X1=rep(inpGrid,length(inpGrid)),X2=sort(rep(inpGrid,length(inpGrid))))
  # before predict.nn existed:
  #gridValsTmp <- compute(inpNN,xyGridTmp)$net.result
  gridValsTmp <- predict(inpNN,xyGridTmp)
  errTmp <- sum(inpNN$err.fct(inpNN$net.result[[1]][,1],inpNN$response))
  plot(xyGridTmp,col=as.numeric(gridValsTmp>inpClassThresh)+1,pch=20,cex=0.3,main=paste("Error:",round(errTmp,6)))
  points(inpNN$covariate,col=tmpClrPch,pch=tmpClrPch)
  ## Equations defining decision boundary:
  ## 1*w0 + X1*w1 + X2*w2 = 0, i.e.:
  ## 0 = inpNN$weights[[1]][1]+inpNN$weights[[1]][2]*X1+inpNN$weights[[1]][3]*X2, i.e:
  ## X2 = (-inpNN$weights[[1]][1] - inpNN$weights[[1]][2]*X1) / inpNN$weights[[1]][3]
  for ( iTmp in 1:ncol(inpNN$weights[[1]][[1]]) ) {
    abline(-inpNN$weights[[1]][[1]][1,iTmp] / inpNN$weights[[1]][[1]][3,iTmp], -inpNN$weights[[1]][[1]][2,iTmp] /inpNN$weights[[1]][[1]][3,iTmp],lty=2,lwd=2)
  }
}
old.par <- par(mfrow=c(1,2),ps=16)
plotNNpreds2D2class(nnRes,1.3)

par(old.par)

Similar, aside from a different position of decision boundary, results can be obtained when using binary representation of the outcome with cross-entropy as error function together with applying activation function to the output layer:

plot(nnResNLO01CE)

Once activation function is applied to the output node, the output values are bound to the \([0,1]\) interval. The predicted values that are far enough from the decision boundary are also approximately set to the average of the outcome in that subspace:

old.par <- par(mfrow=c(1,2),ps=16)
plotNNpreds2D2class(nnResNLO01CE,0.5)

par(old.par)

The important points resulting from the results shown in the figures above are the following:

  • as simple of a model as the one that was employed here (with one node in a single hidden layer along with all other default parameters) cannot do much better than what we observed here
  • because calling (default - logistic) activation function on a given linear combination of the input variables more or less amounts to assigning almost all points on one side of hyperplane (line in 2D, plane in 3D, etc.) to zero and on the other side – to unity
  • weights involved in transforming outcome of the hidden layer into model predictions will change those zeroes and ones to values closer to the desired outcome values, but still, use of such a simple model (with a single hidden node) employed here to prime our intuition more or less amounts to splitting covariate space into two half spaces by a hyperplane and assigning almost constant outcomes to the vast majority of the points on either side of it
  • the weights for the inputs into the single hidden node shown in the network layout plot above and stored in weights field of the result returned by neuralnet define this hyperplane (line in 2D, etc.) shown as dashes in the panels on the right above
  • this hyperplane is where sum of weighted input variables and an intercept is identical zero (and thus the result of logistic activation function is 0.5 rapidly becoming zero or one for points further away from this boundary)

1.2 Two hidden nodes, single hidden layer

Now, let’s add another node to the hidden layer of this network. From the above, we know what to expect as a result of that – another hyperplane (line in 2D) will be added to the space of covariates now dividing it into (depending on whether those hyperplans are almost parallel or not) three or four subspaces, consequently, assigning most of the points to three or four potentially different constants. Clearly, this level of granularity could suffice for developing a model that would do quite well in our toy example.

To have more than one node in the hidden layer we set hidden parameter to the number of nodes in it (length of vector provided as hidden parameter governs the number of the hidden layers in the network – we still use one layer here):

set.seed(1234567)
nnRes2 <- neuralnet(g~X1+X2,data.frame(g=as.numeric(factor(gTmp)),xyTmp),hidden=2)
plot(nnRes2)

We can see that now resulting network has two nodes in a single hidden layer, which two covariates enter with weights that are approximately comparable in magnitude and opposite in sign. Their comparably weighted sum added to a constant close to one gives the outcome value of this model. The effect of those weights in defining decision boundaries in the space of predictors is best seen from the figure below:

old.par <- par(mfrow=c(1,2),ps=16)
plotNNpreds2D2class(nnRes2,1.5)

par(old.par)

This model sets up two almost parallel lines that encompass most of the observations from the first class, leaving most of the observations from the second class outside of the resulting slab. Now let’s repeat fitting neural network three times (each time starting with random choice of starting weights in the model) and compare stability of the resulting models:

old.par <- par(mfcol=c(2,3),ps=16)
for ( iTry in 1:3 ) {
  nnRes2 <- neuralnet(g~X1+X2,data.frame(g=as.numeric(factor(gTmp)),xyTmp),hidden=2)
  plotNNpreds2D2class(nnRes2,1.5)
}

par(old.par)

We can see that quite frequently given the parameters used the process converges to a suboptimal solution with about half of the observations remaining in “gray” zone where their assignment to either of the classes is not immediately apparent.

Aside from the multitude of local minima for neural network fitting procedure that could prevent it from finding better solutions, the main point to take from this exercise is that adding more nodes to the hidden layer (with all other default choices employed here) amounts to adding more hyperplanes bisecting the space of predictors, creating more and more subspaces where the outcome can take different values (often close to a constant in each subspace). Obviously, the geometry of the resulting decision surfaces can become quite complicated even with modest number of nodes in the hidden layer. Lastly, these considerations provide some intuition for considering what could be a useful number of hidden nodes in the model. In thinking about that it might be useful to consider how many such hyperplanes could be sufficient to effectively separate observations belonging to different outcome categories. Not that we necessarily would have such knowledge ahead of time, but this might prove to be a complementary way to think about the problem in addition to the often sited empirical guidelines that are based on the number of predictor variables, etc.

The point of this problem set is to assess how these aspects of neural network fitting play out in another simulated dataset.

Lastly, we also would like you to consider what combination of error function and output node transformation you would like to use for this week problem set. Below are three calls of neuralnet timed using different combinations of err.fct and linear.output:

system.time(invisible(neuralnet(g~X1+X2,data.frame(g=as.numeric(factor(gTmp))-1,xyTmp),hidden=2,linear.output = TRUE,err.fct="sse")))
##    user  system elapsed 
##   0.112   0.007   0.121
system.time(invisible(neuralnet(g~X1+X2,data.frame(g=as.numeric(factor(gTmp))-1,xyTmp),hidden=2,linear.output = FALSE,err.fct="sse")))
##    user  system elapsed 
##   1.501   0.084   1.585
system.time(invisible(neuralnet(g~X1+X2,data.frame(g=as.numeric(factor(gTmp))-1,xyTmp),hidden=2,linear.output = FALSE,err.fct="ce")))
## Warning: Algorithm did not converge in 1 of 1 repetition(s) within the stepmax.
##    user  system elapsed 
##  10.836   0.630  11.465

Obviously, for our toy example use of untransformed outcome from the output node and sum of squares as error function results in the fastest convergence. But then the decision as to how to translate its predictions to class categories is yours. On the other hand, one might argue that more principled approach since we are dealing with classification problem would be to use logistic transform of the outcome to contain it within \([0;1]\) interval and cross-entropy as an error function. Except that it seems to run nearly two orders of magnitude slower in this case and does not necessarily converge with default values for the rest of the arguments. Please feel free to experiment with how they will perform for the problems you are presented with below and decide for yourself which combination of these parameters is more suitable for the task at hand.

2 Problem 1 (10 points): generate 3D data with cube as a decision surface

Simulate data with n=1000 observations and p=3 covariates – all random values from standard normal distribution (\(\mathcal{N}\left(0,1\right), \mu=0,\sigma=1\)). Create two category class variable assigning all observations within a cube centered at \((0,0,0)\) with the edge length of \(2.5\) (i.e. with vertices at \((1.25,1.25,1.25)\), \((1.25,1.25,-1.25)\), …, \((-1.25,-1.25,-1.25)\)) – to one class category and observations outside of this cube – to the second class category. Confirm that the simulated observations are nearly evenly split between the two classes.

Please note that this dataset is entirely different from the one used in the preface – you will need to write code simulating it on your own. Somewhat related 2D dataset was used as a motivational example at week 12 (SVM) lecture before introducing kernels and SVMs. However, the example in the lecture was in 2D (three-dimensional problem here) and had a circular (in 3D – spherical) boundary (here we work with a cube as a decision surface). Since you will be reusing this code in the following two problems it is probably best to turn this procedure into a function with appropriate parameters.

Ten points available for this problem are composed of accomplishing the following tasks:

  1. Correct implementation of the function generating data as prescribed above (3 points)

    #simulate n = 1000obs, p = 3 covariates
    set.seed(4123)
    
    gendata = function(n){
      #generate data points
      x = rnorm(n)
      y = rnorm(n)
      z = rnorm(n)
    
      #classify points
      cl = numeric(n)
      cl[between(x, -1.25, 1.25) == F] = 1 # -1.25 > x, y, or z > 1.25 = outside of boundary
      cl[between(y, -1.25, 1.25) == F] = 1
      cl[between(z, -1.25, 1.25) == F] = 1
    
      df = data.frame(x, y, z, cl)
    
      return(df)
    }
    
    data = gendata(1000)
  2. Check and demonstrate numerically that the resulting class assignment splits these observations, subject to sampling variability, evenly between these two groups (3 points)

    class = as.factor(data$cl)
    summary(class)
    ##   0   1 
    ## 464 536
    barplot(summary(class), names.arg = c('within', 'outside'), main = 'class distribution', ylab = 'n', col = 'pink')

  3. Plot values of the resulting covariates projected at each pair of the axes indicating classes to which observations belong with symbol color and/or shape (you can use function pairs, for example) (2 points)

    scatterplot3d(data$x, data$y, data$z, color = data$cl + 1)

  4. Reflect on the geometry of this problem by answering the following question: what is the smallest number of planes in 3D space that would completely enclose points from the “inner” class? Is this number equal to the number of cube faces or is it something smaller? Larger? Please note that “enclose” above does not mean “perfectly discriminate between the points assigned to two classes” (2 points)

    4 planes, in the shape of a pyramid, would also be able to completely enclose the points from the inner class, which is smaller than the number of cube faces.

3 Extra 5 points problem: boundary for equal split

The boundary for assigning observations to the inner class used above – \(\max_i |X_i| \leq 1.25,i=\{1,2,3\}\) – splits observations from standard normal in 3D nearly evenly between the two classes, but not quite exactly 50/50. Please derive mathematical expression for the value to be used instead of 1.25 in the expression above to contain exactly half of probability density within the inner cube, present its numerical value to the 9-th decimal place and demonstrate numerically that it results in closer to equal split of the simulated observation between the two classes than \(1.25\).

Please note that this is not a problem of devising an algorithm equally splitting a given dataset in two, but purely probabilistic/mathematical question – what threshold to use for a cube size above that will split such dataset exactly evenly on average / in the limit of infinitely large sample size (i.e. that contains precisely half of the probability density of standard normal distribution in 3D).

4 Problem 2 (20 points): neural network classifier

For the dataset simulated above fit neural networks with 1 through 6 nodes in a single hidden layer (use neuralnet implementation). For each of them calculate training error (see an example in Preface where it was calculated using err.fct field in the result returned by neuralnet). Simulate another independent dataset (with n=10,000 observations to make resulting test error estimates less variable) using the same procedure as above (3D standard normal, two classes, cube as a decision surface) and use it to calculate the test error at each number of hidden nodes. Plot training and test errors as function of the number of nodes in the hidden layer. What does resulting plot tells you about the interplay between model error, model complexity and problem geometry? What is the geometrical interpretation of this error behavior?

set.seed(999)

test = gendata(10000) #generate new data
train = data

neurn = function(train, test, plot = T, v = 0, ...){
#store error results
trainsse = numeric()
testerr = numeric()
trainerr = numeric()

for (h in 1:6){
#fit neural net
nn = neuralnet(cl ~ x + y + z, data = train, hidden = h, err.fct = 'sse', threshold = 0.1)

#training SSE error
trainsse[h] = nn$result.matrix[1]

#predict test & training
pred = predict(nn, test)
predtr = predict(nn, train)

#error results test & train
conf = confusionMatrix(data = as.factor(as.numeric(pred > 0.5)), reference = as.factor(test$cl))
testerr[h] = (1 - conf$overall[1]) * 100 #1 - accuracy as percentage

conftr = confusionMatrix(data = as.factor(as.numeric(predtr > 0.5)), reference = as.factor(train$cl))
trainerr[h] = (1 - conftr$overall[1]) * 100
}

#plot error results
if (plot == TRUE){ #optional plotting
par(mfrow = c(1, 3))
plot(trainsse, xlab = 'nodes', ylab = 'SSE', main = 'Training Error (SSE)', type = 'b')
plot(testerr, xlab = 'nodes', ylab = '1 - Accuracy (%)', main = 'Testing Error (%)', type = 'b', ylim = c(0, 100))
plot(trainerr, xlab = 'nodes', ylab = '1 - Accuracy (%)', main = 'Training Error (%)', type = 'b', ylim = c(0, 100))
title(paste("Test nObs:", nrow(test), " Train nObs:", nrow(train), " Null Variables:", v), outer = T, line = -33)
}
invisible(data.frame(testerr, trainerr)) #return error but suppress output
}

neurn(train, test)

As model complexity increases, testing and training error decreases, up to a certain point. At one node, error is high and decreases with the addition of more nodes. Testing error is slightly higher than training error. Error decreases while the number of nodes increases to 4, which is the minimum error as calculated by confusion matrix accuracy. When the model complexity increases to more than 4 nodes, testing and training error both increase slightly. Training error as calculated by neural network SSE continues to decrease with 5 and 6 nodes. Geometrically, this suggests that 4 planes/decision boundaries would effectively classify that data and the use of 6 would not significantly improve results.

5 Problem 3 (30 points): evaluate impacts of sample size and noise

Setup a simulation repeating procedure described above for n=100, 200 and 500 observations in the training set as well as adding none, 1, 2 and 5 null variables to the training and test data (and to the covariates in formula provided to neuralnet). Draw values for null variables from standard normal distribution as well and do not use them in the assignment of the observations to the class category (e.g. x<-matrix(rnorm(600),ncol=6); cl<-as.numeric(factor(rowSums(abs(x[,1:3])<1.25)==3)) creates dataset with three informative and three null variables). Repeat calculation of training and test errors at least several times for each combination of sample size, number of null variables and size of the hidden layer simulating new training and test dataset every time to assess variability in those estimates. Present resulting error rates so that the effects of sample size and fraction of null variables can be discerned and discuss their impact of the resulting model fits.

#function to add null variables
nulldata = function(null, nobs){
  d = gendata(nobs) #generate and classify informative variables

  if (null != 0){ #skip if not adding null variables
  for (v in 1:null){
    null = rnorm(nobs)
    d = data.frame(d, null) #add null variables to dataset
  }
  }
  return(d)
}
#nobs = 100, 200, 500 training
#null = 0, 1, 2, 5 training & test

nobs = c(100, 200, 500)
null = c(0, 1, 2, 5)
for (n in nobs){
  for (v in null){
    train = nulldata(v, n)
    test = nulldata(v, 10000)
    neurn(train, test)
  }
  
}

terrdf = numeric(6)
trerrdf = numeric(6)

for (i in 1:7){ #repeat neuralnet 7 times
nobs = c(100, 200, 500)
null = c(0, 1, 2, 5)
for (n in nobs){
  for (v in null){
    train = nulldata(v, n)
    test = nulldata(v, 10000)
    nn = neurn(train, test, plot = F)
    terrdf = data.frame(terrdf, nn$testerr) #dataframe of all error results
    trerrdf = data.frame(trerrdf, nn$trainerr)
  }
  
}
}
terrdf = terrdf[,-1]
trerrdf = trerrdf[,-1]

#List of plot titles
title = character()
nobs = c(100, 200, 500)
null = c(0, 1, 2, 5)
for (n in nobs){
  for (v in null){
    title = append(title, (paste("nObs:", n, " null:", v)))
  }}  

#separate error by nobs/null conditions & plot 
for (i in 0:6){ 
t = cbind(terrdf[1+i], terrdf[7+i], terrdf[14+i], terrdf[21+i], terrdf[28+i], terrdf[35+i] + terrdf[42+i])
boxplot(t(t), main = paste("Test Error", title[i+1]))

tr = cbind(trerrdf[1+i], trerrdf[7+i], trerrdf[14+i], trerrdf[21+i], trerrdf[28+i], trerrdf[35+i] + trerrdf[42+i])
boxplot(t(tr), main = paste("Training Error", title[i+1]), col = 'pink')
}

Error decreases with increasing number of observations. Increasing the amount of null variables; however, can increase the error and result in much more jagged error curves across the amount of nodes. In particular, 5 null variables shows worsening error rates.

6 Extra 10 points problem: model WiFi localization data

Use neuralnet to model the outcome in WiFi localization dataset that we used in previous weeks using outcome in its original, four-levels format. Your neuralnet models will need to have four output nodes to represent outcome with four levels. Obtain training and test error for this model for several sizes of the hidden layer. Compare resulting test error in predicting Location 3 to that observed for random forest, SVM and KNN approaches earlier in the course.

Session info

For reproducibility purposes it is always a good idea to capture the state of the environment that was used to generate the results:

sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.2.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] caret_6.0-94         lattice_0.20-45      dplyr_1.1.0         
## [4] scatterplot3d_0.3-43 ggplot2_3.4.2        neuralnet_1.44.2    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.10          lubridate_1.9.2      listenv_0.9.0       
##  [4] class_7.3-20         digest_0.6.31        ipred_0.9-14        
##  [7] foreach_1.5.2        utf8_1.2.3           parallelly_1.35.0   
## [10] R6_2.5.1             plyr_1.8.8           stats4_4.2.2        
## [13] hardhat_1.3.0        e1071_1.7-13         evaluate_0.20       
## [16] highr_0.10           pillar_1.8.1         rlang_1.1.0         
## [19] data.table_1.14.8    jquerylib_0.1.4      rpart_4.1.19        
## [22] Matrix_1.5-1         rmarkdown_2.20       splines_4.2.2       
## [25] gower_1.0.1          stringr_1.5.0        munsell_0.5.0       
## [28] proxy_0.4-27         compiler_4.2.2       xfun_0.37           
## [31] pkgconfig_2.0.3      globals_0.16.2       htmltools_0.5.4     
## [34] nnet_7.3-18          tidyselect_1.2.0     tibble_3.2.1        
## [37] prodlim_2023.03.31   codetools_0.2-18     fansi_1.0.4         
## [40] future_1.32.0        withr_2.5.0          ModelMetrics_1.2.2.2
## [43] MASS_7.3-58.3        recipes_1.0.5        nlme_3.1-160        
## [46] jsonlite_1.8.4       gtable_0.3.1         lifecycle_1.0.3     
## [49] magrittr_2.0.3       pROC_1.18.0          scales_1.2.1        
## [52] stringi_1.7.12       future.apply_1.10.0  cli_3.6.0           
## [55] cachem_1.0.6         reshape2_1.4.4       timeDate_4022.108   
## [58] bslib_0.4.2          generics_0.1.3       vctrs_0.6.1         
## [61] lava_1.7.2.1         iterators_1.0.14     tools_4.2.2         
## [64] glue_1.6.2           purrr_1.0.1          parallel_4.2.2      
## [67] fastmap_1.1.0        survival_3.4-0       yaml_2.3.7          
## [70] timechange_0.2.0     colorspace_2.1-0     knitr_1.42          
## [73] sass_0.4.5

The time it took to knit this file from beginning to end is about (seconds):

proc.time() - ptStart
##    user  system elapsed 
##  47.193   2.562  50.470